Skip to content

Conversation

@anicusan
Copy link
Member

Added second accumulate algorithm using coupled lookback of pre-scanned prefixes (=> one extra kernel launch), with that ScanPrefixes algorithm becoming the default on Metal.

This fixes the decoupled-lookback issue on Metal.

…ed prefixes (one extra kernel launch), with that `ScanPrefixes` algorithm becoming the default on Metal.
@anicusan
Copy link
Member Author

We're still faster than the current default Metal accumulate:

using BenchmarkTools
using Metal
import AcceleratedKernels as AK

using Random
Random.seed!(0)

function akacc(v)
    va = AK.accumulate(+, v, init=zero(eltype(v)), block_size=1024)
    Metal.synchronize()
    va
end

function baseacc(v)
    va = accumulate(+, v, init=zero(eltype(v)))
    Metal.synchronize()
    va
end

v = MtlArray(rand(1:100, 10_000_000))

# Correctness checks
va = akacc(v) |> Array
vb = baseacc(v) |> Array
@assert va == vb

# Benchmarks
println("Base vs AK")
display(@benchmark baseacc($v))
display(@benchmark akacc($v))

And timings:

julia> include("accumulate_benchmark.jl")
Base vs AK
BenchmarkTools.Trial: 603 samples with 1 evaluation.
 Range (min … max):  3.369 ms … 52.091 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.746 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.300 ms ±  6.782 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▄▁  ▁ ▅▇▅▂                                                
  █████████████▇▅▇▇▆▇▅▆▅▄▅▄▄▁▄▁▁▅▁▅▁▆▆▄▆▄▇▅▄▄▄▄▁▁▁▁▁▁▁▁▁▄▄▅▆ ▇
  3.37 ms      Histogram: log(frequency) by time     36.9 ms <

 Memory estimate: 45.41 KiB, allocs estimate: 1568.
BenchmarkTools.Trial: 644 samples with 1 evaluation.
 Range (min … max):  4.535 ms … 35.595 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.928 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.770 ms ±  4.089 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅▁    ▁▂▁▂▄▄▃▂                                             
  ███▇▆▁▆██████████▇▄▆▆▇▆▇▅▇▅▆▁▁▅▁▁▁▄▄▁▁▄▁▁▄▄▁▁▁▅▁▁▁▁▁▁▄▄▁▁▄ ▇
  4.53 ms      Histogram: log(frequency) by time     27.6 ms <

 Memory estimate: 16.63 KiB, allocs estimate: 565.

@anicusan anicusan merged commit 3e814ca into main Dec 23, 2024
31 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

accumulate on Metal sometimes fails due to weaker @synchronize guarantees than on other platforms

2 participants